# High-Resolution Processing
C RADIOv2 B
Other
C-RADIOv2 is a visual feature extraction model developed by NVIDIA, offering multiple size versions suitable for image understanding and dense visual tasks.

C
nvidia
404
8
Vit So400m Patch14 Siglip Gap 896.pali Pt
Apache-2.0
Vision model based on SigLIP image encoder, employing global average pooling, part of the PaliGemma project
Text-to-Image
Transformers

V
timm
15
1
Mini InternVL2 1B DA DriveLM
MIT
Mini-InternVL2-DA-RS is a multimodal model optimized for the remote sensing image domain, based on the Mini-InternVL architecture. It has been fine-tuned through a domain adaptation framework and demonstrates excellent performance in remote sensing image understanding tasks.
Image-to-Text
Transformers Other

M
OpenGVLab
61
1
Timesformer Hr Finetuned K600
TimeSformer-HR is a video action recognition model optimized for high-resolution videos and fine-tuned on the Kinetics-600 dataset.
Video Processing
Transformers

T
onnx-community
17
0
C RADIO
Other
A visual feature extraction model developed by NVIDIA for generating image embeddings, supporting downstream tasks such as image classification.

C
nvidia
398
14
Aesthetic Shadow
Aesthetic Shadow is a Vision Transformer model with 1.1 billion parameters, specifically designed for evaluating anime image quality.
Image Classification
Transformers

A
shadowlilac
373
26
Segformer B4 City Satellite Segmentation 1024x1024
Openrail
A satellite image segmentation model based on the SegFormer architecture, specifically designed for urban area segmentation tasks
Image Segmentation
Transformers

S
ratnaonline1
110
4
Efficientnet B6
Apache-2.0
EfficientNet is a mobile-friendly pure convolutional model that uniformly scales depth/width/resolution dimensions through compound coefficients, trained on the ImageNet-1k dataset.
Image Classification
Transformers

E
google
167
0
Timesformer Hr Finetuned Ssv2
TimeSformer is a video classification model based on spatio-temporal attention mechanism, fine-tuned on the Something Something v2 dataset.
Video Processing
Transformers

T
fcakyon
14
0
Timesformer Hr Finetuned K600
TimeSformer is a video understanding model based on spatiotemporal attention mechanisms, with its high-resolution variant specifically fine-tuned for the Kinetics-600 dataset.
Video Processing
Transformers

T
fcakyon
22
0
Timesformer Hr Finetuned Ssv2
TimeSformer is a video understanding model based on spatio-temporal attention mechanisms. This version is a high-resolution variant fine-tuned on the Something Something v2 dataset.
Video Processing
Transformers

T
facebook
550
2
Timesformer Hr Finetuned K400
TimeSformer is a video understanding model based on spatio-temporal attention mechanisms, pre-trained and fine-tuned on the Kinetics-400 dataset.
Video Processing
Transformers

T
facebook
178
2
Beit Base Finetuned Ade 640 640
Apache-2.0
BEiT is a model based on the Vision Transformer (ViT) architecture, pre-trained on ImageNet-21k through self-supervised learning and fine-tuned on the ADE20k dataset, specifically designed for image semantic segmentation tasks.
Image Segmentation
Transformers

B
microsoft
1,645
11
Segformer B0 Finetuned Cityscapes 640 1280
Other
SegFormer is a Transformer-based semantic segmentation model fine-tuned on the Cityscapes dataset, suitable for road scene segmentation tasks.
Image Segmentation
Transformers

S
nvidia
41
0
Featured Recommended AI Models